Development of the multivariate administrative data cystectomy model and its impact on misclassification bias

Background Misclassification bias (MB) is the deviation of measured from true values due to incorrect case assignment. This study compared MB when cystectomy status was determined using administrative database codes vs. predicted cystectomy probability. Methods We identified every primary cystectomy-diversion type at a single hospital 2009–2019. We linked to claims data to measure true association of cystectomy with 30 patient and hospitalization factors. Associations were also measured when cystectomy status was assigned using billing codes and by cystectomy probability from multivariate logistic regression model with covariates from administrative data. MB was the difference between measured and true associations. Results 500 people underwent cystectomy (0.12% of 428 677 hospitalizations). Sensitivity and positive predictive values for cystectomy codes were 97.1% and 58.6% for incontinent diversions and 100.0% and 48.4% for continent diversions, respectively. The model accurately predicted cystectomy-incontinent diversion (c-statistic [C] 0.999, Integrated Calibration Index [ICI] 0.000) and cystectomy-continent diversion (C:1.000, ICI 0.000) probabilities. MB was significantly lower when model-based predictions was used to impute cystectomy-diversion type status using for both incontinent cystectomy (F = 12.75; p < .0001) and continent cystectomy (F = 11.25; p < .0001). Conclusions A model using administrative data accurately returned the probability that cystectomy by diversion type occurred during a hospitalization. Using this model to impute cystectomy status minimized MB. Accuracy of administrative database research can be increased by using probabilistic imputation to determine case status instead of individual codes. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-024-02199-1.


Introduction
Health administrative data (HAD) record information required for regulatory or financial purposes in health care systems.Most HAD-based research uses codes to identify diagnoses or procedures.However, these codes arenever perfectly accurate, and their use returns results that deviate from true values due to incorrect case status assignment.This deviation of measured from true values is termed "misclassification bias" [1].Misclassification independent of other variables is termed 'non-differential' and will bias association measures towards the null; misclassification that varies by other variables is termed 'differential' and can bias association measures with those variables in any direction.All misclassification can bias prevalence estimates in unpredictable directions.Despite it being widely known that codes can be erroneously applied, HAD-based studies often identify cases or outcomes using codes or code combinations with unknown accuracy [2].
HAD are appealing for studying cystectomy since this is a relatively uncommon procedure with notable process variation having considerable morbidity and mortality.Cystectomy with urinary diversion is the standard surgical treatment for both muscle invasive and high-risk non-muscle invasive bladder cancer.Following bladder removal, urinary diversion must be performed to maintain continuous urine drainage and preserve renal function.Urinary diversion is commonly performed using a segment of intestine to which the ureters are attached.This urinary diversion can be either continent or incontinent.A continent diversion usually includes a "neobladder" in which a segment of intestine is reconfigured in a spherical fashion and placed in the bladder's usual location that is attached to the urethra.Therefore, patients store urine in the new bladder and then voluntarily pass urine when able.If this is not possible, a continent cutaneous diversion is created wherein a "catheterizable" pouch is formed by attaching the neobladder to the abdominal wall with a continent stoma through which patients can insert a catheter to drain the stored urine.An incontinent diversion includes a urinary conduit created by attaching a segment of intestine to the abdominal wall which continuously drains into an adherent stoma appliance.Less commonly, the ureters can be attached directly to the abdominal wall to allow for continuous drainage (i.e.cutaneous ureterostomy).
Cystectomies are usually identified in HAD by diagnostic and procedural codes.Two studies have found that a diagnostic code for bladder cancer and a procedure code for radical cystectomy was predictive of radical cystectomy for cancer: Tan et.al [3].found that this combination had a sensitivity of 98.8% (95%CI 93.5-100.0)and a positive predictive value of 93.3% (85.9%-97.5); in an internal validation cohort, Lyon et.al [4].measured a sensitivity of 97.0% (95%CI 93.9-98.8)and positive predictive value of 95.0% (91.4-97.4).However, neither of these code algorithms identified non-oncological radical cystectomy, was externally validated, or measured misclassification bias when cystectomy status was determined using these codes.In addition, these code algorithms did not distinguish between cystectomies with incontinent or continent diversions.This is important since cystectomy with continent diversion is more surgically complicated and has unique outcomes.
We were skeptical that cystectomies with different diversion types could be accurately distinguished in HAD with codes since these procedures are very similar.However, several studies have shown that multivariable prediction models using variables from HAD can accurately predict the probability of particular diagnoses or procedures [5][6][7][8].These studies have also found that using these predicted probabilities to impute case status reduces misclassification bias compared to using codes alone.
In this study, we identified every person who had a cystectomy as a primary surgery for any indication at our hospital over a 10-year period.We then measured the association of cystectomy with 30 variables (reference standard).These association measures were repeated after determining cystectomy status using codes and quantified misclassification bias by compared these association measures to reference standard values.Finally, we created a HAD-based multivariable model that predicted cystectomy probability by urinary diversion type.We then used these probabilities to impute cystectomy status, measured associations of cystectomy with the 30 variables, and compared misclassification bias to that when cystectomy status was determined using codes.Finally, misclassification bias when determining cystectomy status based on codes or the model-based probability was compared.

Study setting
The study took place at The Ottawa Hospital (TOH), a 1000-bed teaching hospital with two campuses that is the tertiary referral and trauma center for a region of approximately 1.3 million people.Annually, TOH has more than 175,000 emergency department visits, 40,000 nonpsychiatric admissions, and 50,000 surgical cases.During the study period, more than 95% of cystectomies in the region were conducted at TOH; near the end of the study period, all such procedures were performed at TOH.

Ethics approval and consent to participate
The study was approved by the Ottawa Health Science Network Research Ethics Board (File: OHRI REB 20220112-01H).This study was a secondary data analysis; except for patient medical record review, all analyses involved deidentified data.The ethic board review waived the need to directly approach patients and retrieve consent for the study.The study period was 1 January 2009 to 1 June 2019.This period corresponded to the time that our hospital's operation registry (used for case identification) existed.

Health administrative datasets used for study
This study used five health administrative datasets (Table 1).The Surgical Information Management System (SIMS) dataset is a registry of all primary surgical procedures conducted at TOH.Following all operations, the surgical team enters the procedure that was performed.Completion rates for these forms are essentially 100% since hospital remuneration is a function of these data and nurses cannot close cases without these data.Data from these forms are inputted by hospital records staff into the operation registry with the primary procedure recorded by a 10-digit alpha-numeric code.The Discharge Abstract Database (DAD) is a population-based health administrative dataset that captures all Ontario hospitalizations, recording patient-based information (age, sex, pre-admission diagnostic codes [using the International Classification of Disease-10th Revision, ICD-10]) and hospitalization information (including admission service, procedural codes [using the Canadian Classification of Interventions, CCI] with dates, admission and discharge dates).The Ontario Health Insurance Plan (OHIP) database contains almost all surgeon service claims (by which surgeons are remunerated) that record the date and procedure type.The Ontario Cancer Registry (OCR) records the date and type of all index cancers diagnosed in Ontario.The Registered Persons Database (RPDB) records the death date of all Ontarians.DAD, OHIP, OCR, and RPDB are stored at ICES (formerly known as the Institute for Clinical Evaluative Sciences).ICES is an independent non-profit organization that houses population-based collections of health administrative datasets for the province of Ontario.

Case identification
Our first goal was to identify all cystectomies performed at TOH during the study period.This was done by querying SIMS for all potential primary cystectomy and urinary diversion procedures using the codes listed in Appendix A. To ensure complete capture of cystectomies, procedure codes for surgeries under which true cystectomies might be misclassified -such as partial cystectomy without urinary diversion or nephroureterectomy -were also included in the query (Appendix A).
True cystectomy-urinary diversion status of these potential cases were determined using manual chart review by a single reviewer (JR).This was determined by reviewing the operative note on the surgical date recorded in SIMS.If an operative note was incomplete, unclear, or missing, supplemental review of associated progress notes and discharge summary was performed.Patients who were classified with cystectomy were subclassifed with either continent or incontinent urinary diversion.The few patients who underwent cystectomy alone without urinary diversion (because both kidneys were simultaneously removed or the patient was already on dialysis for renal failure and anuric) were classified with incontinent diversion.Any case that was unclear regarding its cystectomy-urinary diversion status was reviewed with a second expert reviewer (LL) to provide a final opinion; this occurred in two cases.
We excluded patients less than 18 years of age and cases where cystectomy-urinary diversion was a secondary procedure and part of a larger surgery (such as a total pelvic exoneration for locally invasive rectal cancer).The latter cases were very uncommon and were excluded since they represent a patient cohort that was distinct from those having primary cystectomy-urinary diversion.Finally, patients without valid Ontario health card numbers were also excluded since they could not be linked to Ontario health data for analysis.
These steps identified all primary cystectomy cases at TOH during the study period; therefore, any TOH hospitalization that was not included in this group did not have a primary cystectomy.For all cystectomy cases, we recorded the: 1) Ontario health card number, 2) cystectomy date, and 3) diversion type (incontinent vs. continent).This reference cystectomy dataset was transferred to ICES via a encrypted data portal where the health card number was encrypted to permit linkage with ICES datasets.

Creating the Administrative Data Cystectomy Model (ADCM)
We created our study's analytical dataset by retrieving from the DAD all adult TOH hospitalizations during the study period.This dataset was linked to our reference cystectomy dataset via encrypted health card number and admission date to determine which TOH admissions truly had a cystectomy and, if so, its diversion type.
We reviewed CCI coding manuals to identify all CCI cystectomy codes (Appendix B).CCI codes used to identify cystectomy by urinary diversion within the DAD were reviewed with a Health Records expert at TOH to ensure completeness.Hospitalizations were classified as 'coded with cystectomy by diversion type' if they were assigned at least one CCI procedure code within the DAD (during the hospitalization) or at least one Ontario Health Insurance Plan (OHIP) billing code (with the service date being equal to the true operative date) for the cystectomy-diversion type.
We then reviewed the other co-variables in the DAD, OHIP database, and the OCR to identify patient-, hospital-, and procedure-level factors that might be associated with true cystectomy status.These potential covariates were ranked independently by two surgeons (JR, LL) based on their potential ability to identify cystectomy using health administrative data.These ranks were averaged to return the final covariate priority ranking (Appendix C).
To create the ADCM, multivariable multinomial logistic regression was used to model patient cystectomydiversion type status.We used the methods proposed by Riley et.al [9]. to determine the number of degrees of freedom that our model could contain.This calculation assumed that there would be 429,000 admissions to TOH during the study period with an estimated 250 cystectomies of each diversion type, giving a prevalence of each cystectomy-diversion type of 0.0583%.We then calculated the number of degrees of freedom (df) permitted in the regression model given four criteria: 10 outcomes per df (25 df permitted); mean error around individual predictions of ± 0.05% (42.6 df permitted); target shrinkage of 99.9% (32.1 df permitted); and target optimism in model fit of 0.0005 (106 df permitted).The final allowable degrees of freedom in the model for each cystectomydiversion type was the minimum of these calculations (25 df).
The outcome for the ADCM was true cystectomydiversion type status and it had three values: cystectomyincontinent diversion, cystectomy-continent diversion, or no cystectomy.The ADCM was constructed by adding covariates in rank order of perceived importance for cystectomy identification (Appendix C).Variables were retained if the likelihood ratio test following variable addition was significant at a p-value ≤ 0.05.We used a SAS macro described by Sauerbrei et.al [10].to identify best single fractional polynomial transformations for continuous candidate variables (age, acute length of stay, and operative time).Model building ended when all candidate variables (Appendix C) had been offered to the model.Creation and performance of the ADCM was reported using methods suggested by the TRIPOD statement (Appendix E).

Analysis
Model performance was internally validated using optimism-corrected c-statistic (for discrimination) and optimism-corrected integrated calibration index (ICI-for calibration) [11] using methods described by Steyerberg [12] with 1000 bootstrap samples.
To quantify misclassification bias, we first used the reference standard cystectomy-diversion type status to calculate true values of 30 statistics: cystectomy-diversion type prevalence in study cohort; the association of cystectomy with 3 continuous covariables [patient age, operation time, acute hospital length of stay] measured using linear regression; and the association of cystectomy with 27 binary covariables [sex, admission urgency, general anaesthetic status, transfusion status, discharge status, 28-day death or unplanned readmission status and 21 comorbidities from the Elixhauser morbidity scale] measured using logistic regression.With the exception of the Elixhauser comorbidities, these covariables did not require administrative database codes and are accurately measured in the DAD [13].
We then repeated the measurement of these 30 statistics after assigning cystectomy-diversion status using three methods: 1. CCI / OHIP Code for Cystectomy: Patients with a CCI or OHIP procedure code for cystectomy (Appendix B) were classified with cystectomy-diversion by code.2. ADCM-Categorical:Youden's method was used to determine the ADCM-based predicted probability of cystectomy by diversion type that optimized classification accuracy [14].This threshold corresponds to the predicted cystectomy-diversion type probability that is closest to the top left-hand corner of the corresponding receiver operating characteristic (ROC) curve.Patients with expected cystectomy probabilities equal to or above this threshold were classified with cystectomy-diversion by ADCM-categorical.3. ADCM-Bootstrap Imputation:This method used the ADCM predicted cystectomy probability to impute cystectomy status using bootstrap imputation (BI) [6][7][8]15].BI started by creating 1000 random bootstrap samples (with replacement) of the study cohort with each having a sample size identical to the original cohort.For each hospitalization within each bootstrap sample, a uniformly distributed number between 0 and 1 was randomly selected; cystectomydiversion status was then imputed to be present if the random number was below the ADCM-based predicted cystectomy-diversion type probability for that patient.Within each bootstrap sample, we measured all 30 statistics; the mean value of all 1000 bootstrap samples was used as the final BI point estimate and the 2.5th and 97.5th percentiles as the confidence intervals.
We quantified misclassification bias for each of these three methods to assign cystectomy-diversion status using the standardized mean squared error (SMSE): where: β is the parameter estimate of the covariable's association with cystectomy-diversion determined by the cystectomy-diversion status assignment method (CCI/OHIP code for cystectomy, ADCM-categorical, or ADCM-BI); and β T is the parameter estimate of the covariable's true association with cystectomy-diversion.We compared misclassification bias between cystectomydiversion status assignment methods using ANOVA on log transformed SMSEs of all 30 variables.Differences between assignment methods was determined using Tukey's studentized range test for ANOVA.All analyses were conducted using SAS 9.4.

Results
Query of the hospital's surgical database identified 668 possible total cystectomies (Fig. 1).Medical record review found that 158 (23.6%) were another procedure.Ten total cystectomies (2.0%) were excluded because patients did not have a valid health card number.This left 500 cystectomies in the study, to which 428 197 other adult hospitalizations during the study period were added.
The final cohort included 428 697 hospitalizations (Table 2).Patients were middle-aged (mean age 57.5 years) with the majority being female.A previous diagnosis of bladder cancer in the Ontario Cancer Registry or the Discharge Abstract Database was present in 1.5% and 0.7% of hospitalizations, respectively.A third of hospitalizations were elective and 5.0% were admitted under urology.Almost two thirds of patients (62.3%) were discharged home without supports.In less than 10% of hospitalizations did patients die or get urgently readmitted to any hospital in the month after discharge.
Compared to continent cystectomies, those with incontinent diversions were older (mean age 71.2 vs. 62.6 years), less likely to be previously coded with bladder cancer in OCR (78.8% vs 91.9%), were more likely to be coded with cystectomy-incontinent diversions (97.1% vs. 28.4%),and less likely to be discharged home without supports (12.9% vs. 21.6%)(Table 2).
The Administrative Data Cystectomy Model (ADCM) included 6 variables (Appendix D).Compared to hospitalizations without cystectomy, both cystectomy types were significantly more likely to have an elective admission, get admitted under urology, or experience an unplanned return to the operating room within 28 days.Cystectomy-incontinent diversions were more likely to be coded as such by either OHIP or CCI codes and had a significantly longer hospital length of stay.Cystectomycontinent diversions were more likely to be coded as such by either OHIP or CCI codes and have a previous diagnosis of bladder cancer in DAD.The ADCM had excellent discrimination (optimism-adjusted c-statistic, incontinent diversion: 0.9986 [95%CI 0.9964-0.9994];continent diversion: 1.0000 [95%CI 1.0000-1.0000])and calibration (optimism-adjusted ICI of 0.0000, 95%CI 0.0000-0.0000for both incontinent and continent diversions).Observed and expected cystectomy probabilities were very similar for both cystectomy types (Fig. 2).By Youden's method, the predicted cystectomy probabilities returning the most accurate classification for incontinent and continent diversions was 0.0922% and 0.0626%, respectively.Compared to codes for cystectomy-incontinent diversion (Appendix B), status assignment using this threshold increased sensitivity to 99.6% but decreased specificity from 100.0% to 99.6% and the positive predictive value to 14.6% (Table 3B).Compared to codes for cystectomy-continent diversion (Appendix B), status assignment using the ADCM-continent threshold maintained sensitivity and specificity at 100.0% while increasing positive predictive value to 73.5% (Table 3B).

Table 2 Study Cohort
The database source (

Discussion
This study found that cystectomy-diversion type status can be predicted with high accuracy using a multivariate model created with health administrative data (HAD).Imputing cystectomy-diversion status based on predicted probabilities from this model significantly decreased misclassification bias compared to identifying cystectomy using HAD codes alone.
Our study had several notable findings.First, we found that procedure codes for cystectomy (Appendix B) had sensitivities and specificities approaching 100% (Table 3).Given these operating characteristics, many researchers and readers would believe that misclassification bias would be minimal and would consider code status essentially equivalent to true cystectomy status.However, our data show this conclusion is incorrect since it ignores the very low prevalence of cystectomy (0.12% of all hospitalizations).As a result, these codes had low positive predictive values of 58.6% and 48.4% for incontinent and continent cystectomies, respectively (Table 3).Therefore, hospitalizations identified as having cystectomy with a particular diversion type using these codes had only about a 50% chance of truly having that procedure, resulting in notable misclassification bias (Figs. 2 and  3).These results highlight that it is essential to consider Table 3 Cystectomy classification accuracy with codes and using Administrative Data Cystectomy Model (ADCM) For cystectomy status classification by administrative database codes (Section A), patients were classified with cystectomy by diversion type if they met criteria listed in Appendix B. For cystectomy status classification using the ADCM (Section B), patients were classified with cystectomy-incontinent diversion if the expected probability from the ADCM was at least 0.0922%; patients were classified with cystectomy-continent diversion if the expected probability from the ADCM was at least 0.0626%.(Sens Sensitivity, Spec Specificity, + PV Positive predictive value, -PV negative predictive value, + LR Positive likelihood ratio, -LR Negative likelihood ratio).Values have been rounded to 1 decimal place.*As determined using Youden's method case prevalence when interpreting code accuracy measures.Second, we found that health administrative data can be used to create multivariate regression models which accurately predict procedure probability.The ADCM was extremely discriminative and well-calibrated (Fig. 2) to identify cystectomy-diversion status using health administrative data, despite the relative similarity of the two cystectomy subtypes.However, we also found that misclassification bias remained prominent when we determined cystectomy status by categorizing predicted ADCM probabilities using probability thresholds that maximized accuracy (Figs. 2 and 3).In contrast, misclassification bias was minimized when ADCM-based predictions were used to probabilistically impute cystectomy-diversion status.This phenomenon -wherein categorization of an accurate probability estimate causes misclassification bias -has been seen in previous studies [15].These results highlight the importance of applying predicted case status using probabilistic methods like bootstrap imputation.Several issues should be kept in mind when interpreting our results.First, although our study was done over an entire decade and included almost a half a million hospitalizations, it was completed exclusively at one center.Therefore, it would be important to repeat these methods at another hospital to ensure generalizability.Although health records abstractors -who assign diagnostic and procedural codes at each hospital in Canada -undergo similar training and follow the same coding guidelines, it is possible that coding practices might vary between hospitals.This could influence the performance of the ADCM, which is based partially on coded data.In addition, it would be important to measure the model's performance in hospitals that are distinct from that used to derive the ADCM including non-teaching institutions and hospitals having a much lower-or even absentincidence total cystectomy.Second, our data found that using codes to determine cystectomy status significantly increased misclassification bias compared to probabilistically imputing cystectomy status using the ADCM.However, we found no situations where cystectomy's association with a covariate was categorically opposite when its status was determined with codes versus its true value.Some would conclude that these results validate the use of codes to determine cystectomy status since doing so never returned incorrect binary conclusions.We argue that the goal of any scientific study is to return results that are as close to the truth as possible; therefore, imputing cystectomy status using an accurate prediction model is preferable.Third, our gold standard cohort only included patients who underwent cystectomy as a primary surgery.Therefore, patients undergoing cystectomy as a secondary operation as part of a larger surgery (such as pelvic exenterations) will be excluded.These patients, however, represent a separate cohort of patients with different outcomes as compared to those who underwent cystectomy as a primary operation; we therefore feel that their exclusion is appropriate.It is also possible that a primary cystectomy was miscoded in the SIMS database (Table 1) and would have been missed in the study.3 True cystectomy-incontinent diversion and its unadjusted association with other variables compared to cystectomy status determination using codes or the ADCM.Plots present cystectomy-incontinent diversion incidence (A) or the unadjusted association of the variable with cystectomy-incontinent diversion status determined using true status (TRUE), procedure codes in Appendix B (CODE), predicted cystectomy-incontinent probability from ADCM categorized to optimize accuracy (ADCM-CAT), or bootstrap imputation using ADCM-predicted cystectomy-incontinent probability (ADCM-BI) In summary, despite HAD-based procedure codes for cystectomy-diversion type having very high sensitivities and specificities, our study found that their use to impute cystectomy status resulted in important misclassification bias.This can be minimized by imputing cystectomy status using accurate cystectomy probabilities from the ADCM.Fig. 4 True cystectomy-continent diversion and its unadjusted association with other variables compared to cystectomy status determination using codes or the ADCM.Plots present cystectomy-continent diversion incidence (A) or the unadjusted association of the variable with cystectomy-continent diversion status determined using true status (TRUE), procedure codes in Appendix B (CODE), predicted cystectomy-continent probability from ADCM categorized to optimize accuracy (ADCM-CAT), or bootstrap imputation using ADCM-predicted cystectomy-continent probability (ADCM-BI)

Fig. 2
Fig. 2 Performance of the Administrative Data Cystectomy Models.In both plots, the vertical axis indicates the observed probability of cystectomy with incontinent diversion (A) or continent diversion (B), calculated using a LOESS model.The horizontal axis presents the expected probability of cystectomy with incontinent diversion (A) or continent diversion (B), calculated using the respective Administrative Data Cystectomy Model (Appendix D).Each plot presents the model's optimism-adjusted internally validated discrimination and calibration using the C-statistic (C-STAT) and the integrated calibration index (ICI), respectively

Fig.
Fig.3True cystectomy-incontinent diversion and its unadjusted association with other variables compared to cystectomy status determination using codes or the ADCM.Plots present cystectomy-incontinent diversion incidence (A) or the unadjusted association of the variable with cystectomy-incontinent diversion status determined using true status (TRUE), procedure codes in Appendix B (CODE), predicted cystectomy-incontinent probability from ADCM categorized to optimize accuracy (ADCM-CAT), or bootstrap imputation using ADCM-predicted cystectomy-incontinent probability (ADCM-BI)

Table 1
Description of datasets used for study OHIP (OntarioHealth Insurance Plan) All health care claims Procedure Determine cystectomy claim status for each hospitalization OCR (Ontario Cancer Registry) All Ontario primary cancers Case Determine bladder cancer status RPDB (Registered Persons Database) Demographic information of all Ontarians Person Death status within 28 days of hospital admission

Table 2 )
of each covariate is indicated ( a Discharge Abstract Database, b Ontario Cancer Registry, c Registered Persons Database) SD Standard deviation, % Percentage, OCR Ontario Cancer Registry, DAD Discharge Abstract Database, 1 Primary, OR Operating room, min minutes, IQR Interquartile range, d days, DC Discharge, LOS Length of stay